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Abstract 

Genome sequencing of closely related individuals has yielded valuable insights that link genome evolution to phenotypic 
variations. However, advancement in sequencing technology has also led to an escalation in the number of poor quality- 
drafted genomes assembled based on reference genomes that can have highly divergent or haplotypic regions. The self- 
fertilizing nature of Arabidopsis thaliana poses an advantage to sequencing projects because its genome is mostly 
homozygous. To determine the accuracy of an Arabidopsis drafted genome in less conserved regions, we performed 
a resequencing experiment on a —371 -kb genomic interval in the Landsberg erecta (Ler-0) accession. We identified novel 
structural variations (SVs) between Ler-0 and the reference accession Col-0 using a long-range polymerase chain reaction 
approach to generate an lllumina data set that has positional information, that is, a data set with reads that map to a known 
location. Positional information is important for accurate genome assembly and the resolution of SVs particularly in highly 
duplicated or repetitive regions. Sixty-one regions with misassembly signatures were identified from the Ler-0 draft, 
suggesting the presence of novel SVs that are not represented in the draft sequence. Sixty of those were resolved by iterative 
mapping using our data set. Fifteen large indels (> 1 00 bp) identified from this study were found to be located either within 
protein-coding regions or upstream regulatory regions, suggesting the formation of novel alleles or altered regulation of 
existing genes in Ler-0. We propose future genome-sequencing experiments to follow a clone-based approach that 
incorporates positional information to ultimately reveal haplotype-specific differences between accessions. 
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Introduction 

The number of genome projects of various scales has in- 
creased substantially over the years due to a reduction in 
sequencing costs as technology advances (Chain et al. 
2009). Many laboratories benefit from this impressive 
technological advancement in terms of rapid generation 
of high-depth sequence data. However, next-generation se- 
quencing (NGS) platforms are compromised in their ability 
to generate long reads. Read length reduction is compen- 
sated by an increase in coverage where 20- to 30-fold 
redundancy has been reported as the acceptable criterion 
by most genome projects (Bentley et al. 2008; Ossowski 



et al. 2008). Due to the nature of short-read data sets, 
drafted genomes are assembled based on preexisting pub- 
lished sequences and the quality of the resulting data has yet 
been sufficiently diagnosed. This has led to the mass release 
of drafted genomes (Chain et al. 2009), many of whose 
qualities are only assessed by identifying the number of as- 
sembly gaps. Other valuable diagnostic criteria such as the 
number of errors and misassemblies are potentially missing 
and can only be revealed with fine-scale analysis. 

Computational algorithms have been developed specifi- 
cally to tackle short-read data sets (Butler et al. 2008; 
Ossowski et al. 2008; Zerbino and Birney 2008; Simpson 
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et al. 2009). The identification of single nucleotide polymor- 
phisms (SNPs; Shen et al. 201 0) and small insertion-deletion 
polymorphisms (indels; Krawitz et al. 2010) using a combi- 
nation of multiple assembly algorithms that are each de- 
signed and optimized for different purposes had seemed 
to be the end goal of genome projects as other forms of 
deviations relative to the reference genome remain chal- 
lenging to detect. Resolving SVs, that is, changes that are 
not single nucleotide variants, such as duplications, inver- 
sions, large indels, and copy number variations (CNV) (Feuk 
et al. 2006; Frazer et al. 2009), have been proven problem- 
atic for short-read assemblers (Snyder et al. 2010). Prior to 
the arrival of NGS technology, comparative genomic hybrid- 
ization using oligonucleotide arrays have been extensively 
used as analysis tools for the discovery of submicroscopic 
SVs (Sebat et al. 2004; Gresham et al. 2008). Recently, sev- 
eral methods have been developed to detect SVs from NGS 
data sets (Korbel et al. 2007; Chen et al. 2009; Snyder et al. 
201 0). The accuracy of these techniques remains to be suf- 
ficiently tested particularly on highly complex eukaryotic ge- 
nomes. An example that can potentially result in assembly 
error is when a tandem duplication spanning across an in- 
version allele may be interpreted as a de novo complex du- 
plication if only one inversion haplotype is represented in the 
reference genome (Zhang et al. 2009). The lack of strategies 
to transverse across rearrangements and co-occurrences of 
SVs between chromosomal haplotypes can cause assembly 
gaps as sequence reads from paralogous regions are mis- 
taken as allelic overlaps when they map to a single location 
(Bailey et al. 2001 ; Sharp et al. 2006). This problem further com- 
plicates accurate variant calling and may hamper large indel de- 
tection in such regions. Improper placement of scaffolds may 
also introduce nonexistence of heretical evolutionary breakages 
(Lewin et al. 2009). 

Arabidopsis thaliana, a flowering plant from the Brassica- 
ceae family, is one of the best studied plant species due to its 
tractability and the number of research tools available. The 
self-compatible nature of Arabidopsis has allowed each ac- 
cession or lineage to evolve independently yielding diverse 
populations that display a multitude of phenotypic varia- 
tions (Koornneef et al. 2004). Several groups have 
embarked on the 1001 A. thaliana genome project (Weigel 
and Mott 2009) dedicated to generate genome sequences 
from numerous accessions of this species. Comparative 
genomics have frequently been used as a tool to study 
evolution by natural selection (Feuillet and Keller 2002; 
Nishiyama et al. 2003; Bowman et al. 2007; Koonin 2009). 
By comparing two or more genomes, one can infer how nat- 
ural selection acts in different lineages in driving sequence 
evolution in genes and nongenic regions and how these 
changes relate to phenotypic evolution and adaptation 
(Ellegren 2008). Investigating patterns of divergence around 
known functional elements could yield insights on the effect 
that different forces, for example, purifying selection and 



genetic hitchhiking (Cai et al. 2009), have on genetic poly- 
morphisms (Altshuler et al. 2010). 

It has been reported that approximately one quarter of 
the A. thaliana reference genome involves regions that 
are highly divergent with the presence of rare alleles in at 
least one accession (Zeller et al. 2008). Genomic SVs under- 
lie phenotypic differences between A. thaliana accessions 
(Fransz et al. 2000; Meyers et al. 2005; Alonso-Blanco 
etal. 2009). SVs are predominantly multigenicoreven multi- 
loci and may not be represented in the reference accession. 
The role of SVs in chromosomal speciation has been shown 
in several models (White 1978), an example being the sup- 
pressed-recombination model where a genetic barrier is 
formed between populations. Substitutions linked to these 
rearranged chromosomes cannot be exchanged, thereby 
promoting genomic incompatibilities and hence speciation 
(Rieseberg et al. 1 999; Perry et al. 2008; Bikard et al. 2009; 
Marques-Bonet et al. 2009; Alcazar et al. 2010). Complex 
SVs also promote genome instability by long-distance non- 
allelic homologous recombination leading to further CNV 
(Johnson et al. 2006). Orthologous regions enriched with 
ancestral segmental duplications may serve as hot spots 
for constant genomic turnover, and recurrent CNV genesis 
happens as a result of evolutionarily shared duplications oc- 
curring across and within species (Perry et al. 2008). 

The stream of drafted genomes released has far outnum- 
bered the small group of high-quality genomes (Chain et al. 
2009). Downstream comparative genomics heavily depends 
on the fidelity of these drafts. A poor quality draft is there- 
fore prone to misinterpretations (Choi et al. 2008; Meader 
et al. 201 0). Here, we performed a fine-scale assessment of 
the Landsberg erecta (Ler-0) drafted genome at a selected 
polymorphic locus. We identified and resolved novel SVs in 
a contiguous Ler-0 locus using high-coverage lllumina reads 
that were generated from an experimental method that 
incorporates positional information. This work not only 
highlights the importance of rectifying errors on drafted ge- 
nomes before they are used in downstream applications but 
also provides an unprecedented view on genomic diver- 
gence in an inbred species. We propose future genome 
projects to proceed in a manner that incorporates positional 
information in order to improve genome assembly and to 
reveal large deviations from reference genomes. 



Materials and Methods 

Genomic DNA Extraction 

Arabidopsis thaliana seed stocks for the Ler-0 accession 
were obtained from the Nottingham Arabidopsis Stock 
Centre (ID: NW20). High-quality genomic DNA suited for 
long-range (LR)-polymerase chain reaction (PCR) amplifica- 
tion was extracted from 21 -day-old frozen leaf material 
according to the modified method of van der Biezen (van 
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der Biezen et al. 1 996). Four grams of leaf tissue was ground 
in liquid nitrogen and vortexed in 25 ml chilled extraction 
buffer (0.35 M sorbitol, 0.1 M Tris-HCI, 5 mM ethylenedia- 
minetetraacetic acid [EDTA], pH 7.5, 20 mM Na 2 S 2 0 5 ). The 
crude extract was centrifuged at 14,000 revolutions per 
minute (rpm) for 1 h at 4 °C, and the supernatant was dis- 
carded. A 1 .25 ml of extraction buffer, 1 .75 ml nucleus lysis 
buffer (0.2 M Tris-HCI, 50 mM EDTA, 2 M NaCI, 2% hex- 
adecyl-trimethyl-ammonium bromide pH 7.5), and 0.6 ml of 
5% sarkosyl were used to dissolve the pellet. The mixture 
was subsequently incubated for 1 h at 65 °C. Chloro- 
form/isoamylalcohol (24:1 v/v) extraction was performed 
by adding 7.5 ml of the solvent mixture to the tube, fol- 
lowed by centrifugation at 14,000 rpm for 15 min. Clear 
supernatant was transferred to a clean tube, and DNA 
was precipitated with an equal volume of chilled isopropa- 
nol and incubated on ice for 20 min before centrifugation at 
14,000 rpm for 1 5 min. The isopropanol was decanted, and 
the pellet was washed with 70% ethanol and air dried for 20 
min. The pellet was dissolved in 500 uJ Tris-ethylenediami- 
netetraacetic acid (TE) buffer containing 10 jlxI of 10 mg/ml 
RNaseA. Genomic DNA was stored at 4 °C to prevent mul- 
tiple freeze-thaw sessions that might hamper LR-PCR ampli- 
fications. 

LR-PCR Amplification and lllumina Sequencing 

Primers for LR-PCR were designed using Primer3 (Rozen and 
Skaletsky 2000) to amplify overlapping genomic fragments 
of 647-13,702 bp, spanning an —371 -kb contiguous locus 
in Ler-0 (supplementary table 2A, Supplementary Material on- 
line). LR-PCR amplifications (milliQ water: 75.6 jil; 10x buffer: 
10 |il; deoxyribonucleotide triphosphate [2.5 mM]: 8 (il; for- 
ward primer [10 jiM]: 2 jil; reverse primer [10 |iM]: 2 jllI; 
high-fidelity Takara ExTaq enzyme [5 units/jil]: 0.4 jlxI ; DNA tem- 
plate [90 ng/jil] : 2 jlxI for 1 00 jlxI reaction) were performed using 
an autosegment extension program (3 min 94 °C/30 s 94 °C, 30 
s 62 °C, 5-1 0 min 68 °C, 30 x cycles/5 min 68 °C), increasing the 
extension time for 1 5 s each cycle after 14 cycles in the Palm- 
Cycler. PCR products were separated on 0.8% (for fragments 
larger than 1 0 kb) or 1 .0% (for fragments smaller than 1 0 kb) 
1 x Tris-acetate-EDTA gel for amplicon size confirmation, fol- 
lowed by purification using the QIAquick PCR Purification 
Kit. Concentration of each purified PCR productwasquantified. 
Ler-0 amplicons were pooled in equal molarity to yield DNA in 
the concentration of 5 (ig/50 jlxI TE. Sequencing was performed 
on the GAII to generate a 75-bp single-read data set. 

Pipeline Analysis and Read Trimming 

lllumina Pipeline version 1 .6 was used for pipeline analysis. 
Off-Line Basecaller programs, Firecrest and Bustard, were 
used for image analysis and base calling, respectively. Ap- 
proximately 87.9% of clusters passed filtering. The GERALD 
module in CASAVA 1.6 was used to combine tile-based 



.qseq files into a single .txt file. File conversion from .qseq 
to .fastq was done using SSAKE (Warren et al. 2007) qseq2- 
fastq.pl script. Reads were trimmed according to a Phred 
score of 20 using the TQSfastq.py script. SSAKE was further 
utilized to generate de novo contigs under the following pa- 
rameters — m: 15 (minimum number of overlapping bases 
with the seed during overhang consensus build up) and 
x: 1 5 (minimum overlap between contigs to merge adjacent 
contigs in a scaffold). 

Detection of Misassemblies and Variant 
Identification 

Trimmed reads were assembled to either Col-0 reference or 
Ler-0 draft sequence using Geneious assembler (Drummond 
et al. 201 0) by allowing 4-6 mismatches and 5-50 bp gaps 
to account for indels. Misassemblies were identified by 
detecting aberrant assembly signatures in Geneious. Two 
hundred bases at the left and right flanks of the ambiguous 
regions were extracted and used as references for targeted 
iterative read mapping described in the Results section. 
A SHORE consensus analysis (Ossowski et al. 2008) was per- 
formed to obtain GC content and errors in read positions. 
SNPs were identified using the Find Variations/SNPs option 
in Geneious by setting the minimum coverage parameter to 
100 and minimum variant frequency parameter to 0.8. 
Locus alignments of Col-0, Ler-0 draft, and Ler-0 revised se- 
quences were generated using the progressiveMauve 
aligner (Darling et al. 2010). 

Validation of SVs by Sanger Sequencing 

Several resolved indels were randomly chosen for validation 
by Sanger sequencing. Genomic DNA was PCRamplified us- 
ing primers designed by Primer3. Both Ler-0 and Col-0 alleles 
were amplified, and size differences were visualized on an 
agarose gel. The Ler-0 allele was subjected to dideoxy 
sequencing by the ABI-Sanger instrument followed by align- 
ment of the sequence trace to the iteratively resolved indel 
for validation purposes. 

Data Deposition 

Ler-0_chromosome_3_locus.fasta (GenBank: HQ698308). 

Results 

LR-PCR Amplification of a Polymorphic Ler-0 
Genomic Interval 

An —371 -kb genomic interval on chromosome 3 (Col-0 po- 
sition: 16653794-17025087) that spans six Col-0 bacterial 
artificial chromosomes (BACs), that is, F18N11, F9K21, 
T6D9, F16L2, F12M12, and F18L15, was selected for 
the study of the prevalence of SVs between a reference 
(Col-0) and a nonreference (Ler-0) Arabidopsis accession. 
LR-PCR was used to amplify overlapping genomic fragments 
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Fig. 1. — LR-PCR amplification and large indel polymorphisms between Ler-0 and Col-O. Ler-0 locus that corresponds to 16653794-17025087 
positions on Col-0 chromosome 3 is amplified in 49 overlapping fragments using LR-PCR. (A) Several examples of LR-PCR amplicons are shown on the 
gel. {B) Gel image depicts large indel polymorphisms between Ler-0 and Col-0. Amplicon identifier: L, Ler-0 allele; C, Col-0 allele; PX, primer identifier. 
(0 Illustration of locus-specific genetic architecture between Ler-0 and Col-0. Putative large indels are represented as red blocks and six unamplified 
gaps as gray blocks. 



within the locus with amplicon sizes ranging from 647 to 
13,702 bp (fig. ^A). LR-PCR was performed in two steps. 
In the first step, 40 primer pairs were used for amplification, 
and we were able to obtain 29 out of 40 amplicons. In the 
second step, an additional 26 primer pairs were designed to 
divide regions that were not obtained in the first round into 
2 or 3 smaller fragments (supplementary table 2A, Supple- 
mentary Material online). The second round of amplification 
is crucial to rule out chances of obtaining no amplicons due 
to misannealing of the first primer pairs to polymorphic Ler- 
0 sites because Col-0 is used as the reference for primer de- 
sign. From the second round, 20 additional amplicons were 
obtained. The entire locus is spanned by 49 amplicons, in- 
cluding six gaps that were not covered by PCR (fig. 10- In an 
attempt to bridge the gaps, additional primers that spanned 
those gaps were designed. However, we were still unable to 
obtain any amplicon for the six gaps, suggesting the pres- 
ence of large insertions in these regions that are beyond am- 
plifiable range, that is, larger than 25 kb (data not shown). 
The locus of study is partitioned into 49 amplicons that rep- 
resent genomic fragments obtained from known locations 
and thus having positional information. Comparison be- 
tween Col-0 and Ler-0 amplicon lengths revealed that at 
least six amplicon pairs harbor large indel polymorphisms 
(fig. 1 B). We hypothesized that in addition to these six large 
indels, a considerable number of indels of significant size 



remained undetected due to limited gel resolution. Overall, 
the PCR results suggest that large SVs exist within the se- 
lected genomic region between the two Arabidopsis acces- 
sions. 

High-Coverage Sequencing of Amplicons to Detect 
Interaccession SVs 

lllumina Sequencing and Read Mapping to the 
Col-0 Reference. To precisely capture the sequence con- 
text of these SVs, we proceeded to sequence the —371 -kb 
contiguous locus in Ler-0 using the lllumina Genome Ana- 
lyzer II platform. Ler-0 amplicons were pooled in equal mo- 
larity (supplementary fig. 1A Supplementary Material online) 
and sequenced to generate a 75-bp single-read data set (sup- 
plementary fig. IB, C, and D, Supplementary Material online) 
with positional information. The filtered and quality-trimmed 
reads were assembled to the Col-0 reference locus by al- 
lowing up to four mismatches and gaps of up to 50 bp to 
permit small indel detection. Using the Geneious software 
(Drummond et al. 2010), misassembly signatures, indica- 
tive of SVs, were identified. Deletions in Ler-0 were seen 
as gaps in the assembly and insertions as arrays of consec- 
utive mismatches (supplementary fig. 2, Supplementary 
Material online). To locate the region of the six large indels 
as observed from differences in Col-O/Ler-0 amplicon 
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lengths (fig. 15), all primer sequences were aligned to the 
Col-0 reference. By tracking the flanking primers for the 
six large indel amplicons, misassembly signatures found at 
those regions confirmed the occurrences of indels in Ler- 
0. In addition, 123 non-SNP misassembly signatures were 
found (excluding the six unamplified gaps) that corrobo- 
rated our initial speculation on the presence of additional 
indels that fall below the range of gel-based detection. 

Read Mapping to the Ler-0 Draft. The Wellcome Trust 
Centre for Human Genetics (WTCHG) has generated an Ler- 
0 draft genome from 36- to 51 -bp paired-end lllumina libraries 
of approximately 40-fold coverage. As part of our analysis, we 
subsequently used the ler-0 draft as the reference for read map- 
ping based on the assumption that the draft sequence would be 
a better reference than Col-0. In parallel, our analysis will also 
serve asan indicator of Ler-0 draft sequence quality. Reads were 
assembled to the Ler-0 draft by allowing up to six mismatches 
and 5 bp gaps. The number of non-SNP misassembly signatures 
was reduced from 123 misassemblies down to 61 misassem- 
blies when the draft sequence was used. However, the large 
SVs detected from PCR amplicon sizing were not represented 
in the Ler-0 draft sequence (table 1 ). The Ler-0 draft was gen- 
erated by a combination of de novo assembly and reference- 
based mapping. Hence, a large pool of de novo contigs could 
not be incorporated in the draft due to lack of sequence context 
from the Col-0 reference and the lack of positional information 
for these contigs. Therefore, it is expected that SV sequence in- 
formation remained in the pool of unmapped contigs. 

Improving Local Assembly to Reveal SVs Using 
Targeted Iterative Read Mapping. The PCR-based ap- 
proach provides us with the information that reads obtained 
originate from the target locus and not from other genomic 
regions. We assumed that the pool of unmapped reads 
(—7%) accounted for SVs. Geneious assembler was used 
to perform a targeted iterative read-mapping step to map 
these reads to their designated regions. Each misassembled 
region was flagged, and their left and right flanking sequen- 
ces were extracted for iterative mapping. Iterative read map- 
ping consists of the following five steps (fig. 2): 1) Extract 
200-bp sequences that flank the misassembled region 
(these flanks serve as reference sequences for subsequent 
iterative mapping). 2) Map all reads to both flanks indepen- 
dently. 3) After each round of iteration, reads that 
assembled to the border of the flank will have sequences 
extended beyond this flank. The extended sequence is then 
incorporated to the border of the initial flank to produce 
a longer flank that is the combination of the initial flank 
and the assembled read sequence (approximately 45-50 
bp for each iteration). Reads are then remapped to the 
new reference flank. 4) Repeat steps 2 and 3 until the left 
and right iteratively "extended" flanks overlap and can be 
aligned. 5) Incorporate the new local consensus sequence 



Table 1 

Ler-0 Amplicon Size Estimates Correlate with the Actual Lengths in the 
Ler-0 Revised Sequence 





Gel-Estimated 


Col-0 


Ler-0 


Ler-0 




Ler-0 


Length 


Draft 


Revised 


Primer ID Amplicon Length (bp) 


(bp) 


Length (bp) 


Length (bp) 


P12 


13,500 


9,816 


9,816 


14,526 


P19 


12,000 


6,811 


6,989 


13,389 


P28 


11,000 


8,683 


8,755 


12,285 


P17A2 


7,000 


2,084 


2,084 


6,782 


P22 


5,000 


13,702 


13,741 


5,691 


P31B 


1,300 


4,282 


4,340 


1,248 



Note. — Ler-0 amplicon lengths were estimated on an agarose gel, and Col- 
0 lengths were obtained from TAIR. The corresponding lengths of these amplicons 
were determined by mapping flanking primer sequences to the Ler-0 draft and Ler- 
0 revised sequence. PX, primer identifier. 



into the reference sequence followed by realigning all orig- 
inal reads to the modified reference. 

Manual iterative steps allowed us to pinpoint problematic 
regions that could not be resolved by automated assembly 
programs. In a particular iterative step when there was more 
than one possible read option for subsequent contig exten- 
sion (fig. 3>4), we could not proceed onto the next iteration. 
Instead, iterative read mapping was performed from the op- 
posite flank until it could be aligned to the previous flank. 
Regions were flagged as unresolved when more than one 
read option was obtained from both left and right flank ex- 
tensions, as selecting any one of these possible read options 
would ultimately result in an incorrect final consensus 
sequence. This step is crucial to prevent the generation of 
incorrect chimeric contigs that occur when attempting to 
assemble duplicated or conserved regions. By referring to 
each amplicon size, the newly assembled sequence can 
be cross-checked with the estimated PCR product length. 

Out of the 61 misassembled regions, 57 were resolved 
(supplementary table 1, Supplementary Material online) 
by local iterative mapping, whereas the remaining four re- 
gions could not be confidently determined. For the first 
three regions, more than one option in iterative extensions 
from both flanks was present (Fig. 3A). Nevertheless, initial 
iterative results suggested the presence of duplications in 
these regions. We subsequently attempted to resolve these 
regions by making use of de novo contigs generated by 
SSAKE (Warren et al. 2007) using only unmapped reads. 
In the first two regions, a single de novo contig mapped 
to each of the corresponding iterative flanks. The contigs 
were incorporated into the flanks, and iterative mapping 
was performed to validate the contig sequence. In the third 
region, more than one contig mapped to the flanks, and the 
correct one could therefore not be confidently identified 
without further analysis. Thus, 2 out of the 3 regions were 
resolved by de novo contig mapping combined with an it- 
erative validation step. In the fourth region, we encountered 
stretches of long CT-AG inverted repeat sequences from 
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Fig. 2. — Draft sequence correction by iterative read mapping. (A) An insertion site is identified by detecting misassembly patterns as described in 
supplementary figure 2 (Supplementary Material online). Left and right sequences that flank the incorrect region on the draft are used as references for 
local iterative read mapping. (B) In this particular case, two rounds of iterative mapping from both flanks are sufficient to span the insertion. (0 
Alignment between iteratively extended left and right flanks. 



both left and right flank directions (fig. 3B). This region is 
estimated to be 2 kb in length by cross-checking to its cor- 
responding amplicon size (P19 in table 1). An — 1 -kb de 
novo contig flanked by CTand AG sequences was identified 
and was confirmed to be present within the region by 
restriction digestion on the PCR amplicon from this region. 



Because the CT-GA repeats extended beyond the read 
length (reads that consist entirely of these dinucleotide 
repeat sequences were identified), the actual length of 
the repeats could not be deduced. Nevertheless, the results 
suggest that the total length of the combined CTand AG 
repeats is close to 1 kb. Repeat expansion has been found 
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Fig. 3. — Limitations of iterative read mapping. (A) Figure illustrates more than one possible read option obtained during iterative mapping. 
Iterative extension is then performed from the opposite flank to prevent the generation of chimeric contigs. (B) Figure depicts a stretch of long inverted 
dinucleotide repeat in Ler-0 that is absent from the Col-0 genome. Further iterative steps are not possible in this region as repeat length is longer than 
the read length. This region is estimated to be 2 kb in length based on PCR amplicon size. 
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Table 2 

Comparative Analysis of Col-0, Ler-0 Draft (WTCHG), and Ler- 
0 Revised Sequence Using Locus-Specific and Whole-Genome Ler- 
0 Reads 





No. of Aligned 


No. of Aligned 




Locus-Specific 


WTCHG Whole- 




Ler-0 Reads 


Genome Ler-0 Reads 




(Mean Coverage) 


(Mean Coverage) 


Col-0 


2,002,286 (375.4) 


161,312 (16.1) 


Ler-0 draft (WTCHG) 


3,096,868 (595.5) 


210,349 (21.9) 


Ler-0 revised 


3,432,240 (643.6) 


220,178 (22.5) 



Note. — Locus-specific reads and whole-genome reads are aligned to the Ler- 
0 draft and revised sequences. 

to cause environment-dependent genetic defects in Arabi- 
dopsis (Sureshkumar et al. 2009). Interestingly, TAIR Blast 
(http://arabidopsis.org/Blast/index.jsp) revealed that this 
stretch of long inverted repeats (CTand GA) is not found 
anywhere in the Col-0 genome, hence not represented in 
the Ler-0 draft sequence either. In total, 60 (98.4%) out 
of 61 misassembled sites were resolved to generate a revised 
Ler-0 sequence of 375,893 bp in length. The largest inser- 
tion and deletion resolved by iterative read mapping were 
4,819 and 5,139 bp, respectively. By accounting for the size 
of the six unamplified gaps, the Ler-0 locus was estimated to 
be considerably larger than its Col-0 counterpart. 



To evaluate the accuracy of the Ler-0 revised sequence, 
locus-specific reads were mapped to all three sequences 
(Col-0, Ler-0 draft, and Ler-0 revised) using the most strin- 
gent parameters (no gaps and no ambiguities were allowed). 
Because only reads that have no errors were included, the 
mean coverage decreased from —930-fold (when one error 
is allowed) to ~643-fold (only perfect reads allowed). The 
same process was repeated using Ler-0 whole-genome reads 
from WTCHG. In comparison to the Ler-0 draft, the Ler-0 re- 
vised sequence is a better reference (table 2). From the 
stringent alignment of locus-specific reads to the Ler-0 re- 
vised sequence, seven gaps were identified (six gaps 
corresponding to unamplified regions and one gap to the 
aforementioned unresolved region). Similarly, seven gaps 
were present when Ler-0 whole-genome reads were aligned 
to the revised sequence. However, size differences were 
observed in the gaps when either locus-specific reads or 
whole-genome reads were used. This is due to the fact that 
PCR primers were designed to amplify regions that are 
spanned by the forward and reverse primers. However, 
variations in Ler-0 do not always start and end at the 
primer-binding sites. The absence of misassembly signatures 
overall demonstrates that the Ler-0 revised sequence is supe- 
rior to the draft sequence. Moreover, Sanger sequencing on 
1 6 random corrections subsequently confirmed that all were 
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Fig. 4. — Schematic diagram of polymorphisms on the selected Ler-0 locus. Variations between Ler-0 and Col-0 are indicated on the diagram. {A) 
Figure illustrates the pairwise alignment between Ler-0 and Col-0. TAIR1 0 annotated genes (green arrows) and transposable element genes (red arrows) 
are indicated, respectively. Detailed representations of the variations between Ler-0 and Col-0, (B) SNPs, (O insertions, and (D) deletions, are indicated in 
the diagram. 
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Table 3 

Large Indels that Overlap Genes and Regulatory Regions 



Figure ID 


Col-0 Gene 


Gene Description 3 


Figure SA 


(TAIR:At3G45500) 


RING/U-box protein with C6HC-type zinc finger 


Figure SB 


(TAIR:At3G45490) 


RING/U-box superfamily protein 


Figure 5C 


(TAIR:At3G45840) 


Protein binding/zinc ion binding 


Figure 5D 


(TAIR:At3G45955) 


tRNA-Val 


Figure 5E 


(TAIR:At3G461 1 0) 


Unknown protein 


Figure 6A 


(TAIR:At3G45990) 


Cofilin/tropomyosin-type actin-binding protein 


Supplementary figure 44 (Supplementary Material online) 


(TAIR:At3G46060) 


Small GTP-binding protein 


Supplementary figure 4B (Supplementary Material online) 


(TAIR:At3G45910) 


Unknown protein 


Supplementary figure 4C (Supplementary Material online) 


(TAIR:At3G45540) 


RING/U-box protein with C6HC-type zinc finger 




(TAIR:At3G45550) 


Non-LTR retrotransposon family (LINE) 




(TAIR:At3G45555) 


Zinc finger (C3HC4-type RING finger) family protein 


Supplementary figure 4D (Supplementary Material online) 


(TAIR:At3G45750) 


Nucleotidyltransferase family protein 




(TAIR:At3G45755) 


Transposable element gene 




(TAIR:At3G45760) 


Nucleotidyltransferase family protein 


Supplementary figure 4E (Supplementary Material online) 


(TAIR:At3G45673) 


Unknown protein 



Note. — LTR, long terminal repeat. 

information obtained from the TAIR10 genome annotation 



accurate (supplementary table 2B, Supplementary Material 
online). 

To investigate the feasibility of our method for low-coverage 
whole-genome data sets, targeted iterative read assembly was 
performed on a random unmapped contig obtained from 
WTCHG's Ler-0 N50 de novo contigs. Using the Ler- 
0 whole-genome reads from WTCHG that has a modest 
coverage of 40-fold, a selected 478-bp contig was iteratively 
extended to a 2,091 -bp sequence. This sequence was vali- 
dated by Sanger sequencing (Lai AG, Dijkwel PP, unpublished 
data) and does not align to any region of the Ler-0 draft, sug- 
gesting that it is present within a haplotype-specific insertion in 
Ler-0. In an attempt to fill in the six unamplified gaps in the 
locus of interest, iterative read mapping was done using 
WTCHG Ler-0 whole-genome reads as well as the de novo 
contigs. However, we were mostly unsuccessful for several rea- 
sons. The relatively low-coverage data set along with the lack 
of read positional information did not allow accurate iterative 
mapping particularly when the region is duplicated or is highly 
repetitive. Furthermore, because of the lack of positional infor- 
mation, the correct de novo contig that aligns to the border of 
the gap could not be selected when there is more than one 
possible match. 

The previously predicted SVs were resolved by iterative 
read mapping using a high-coverage data set aided by 
PCR-based positional information. In total, 31 large indels 
(>100 bp), 52 smaller (<1 00 bp) indel-like misassemblies, 
and 722 novel SNPs that were not present in the Ler-0 draft 
sequence were identified. On average, one SNP per 97 bp 
(10 SNPs/kb) and one indel per 507 bp (2 indels/kb) were 
detected between Ler-0 and Col-0. Novel variations identi- 
fied from this study were not represented in the Ler-0 draft 
presumably because they occurred in duplicated or highly 
conserved regions where these regions can hamper accurate 



variant calling. Alignment between the Ler-0 and Col-0 loci 
yielded a pairwise identity of 84.2%. In addition, we provide 
a snapshot of variations between Ler-0 and Col-0 at this small 
genomic interval (fig. 4, supplementary fig. 3 and table 1, 
Supplementary Material online). 

Biological and Evolutionary Significance of SVs 

We next determined whether the SVs could have effects on 
genes. According to TAIR1 0 annotation, the 371 -kb locus on 
chromosome 3 comprises 102 genes, 3 transfer RNA genes, 
and 6 transposable element genes (fig. AA). Fifteen novel 
large indels were found to be present within genes and reg- 
ulatory regions (table 3). These large indels are grouped into 
three categories: 1) SVs that alter predicted open reading 
frames, 2) SVs located in regulatory regions, and 3) SVs 
affecting clusters of genes with similar functions (fig. 5; 
supplementary fig. 4, Supplementary Material online). 

In the first category, SVs were found to either disrupt 
genes or, as observed in several cases, predicted to produce 
new transcripts. Figure 5A depicts a copia-like retrotranspo- 
son insertion within the second intron of (TAIR:At3G45500), 
whereas in figure SB, a transposon insertion before the first 
exon of (TAIR:At3G45490) is illustrated. In another example, 
an 812-bp deletion was identified in a 3.4-kb Col-0 cofilin/ 
tropomyosin-type actin-binding gene (TAIR:At3G45990) 
(fig. 6A). Interestingly, TAIR10 Gbrowse (http://g browse. 
arabidopsis.org/cgi-bin/gbrowse/arabidopsis/) revealed that 
no expressed sequence tag was found for this gene. A gene 
prediction program (Stanke and Morgenstern 2005) pre- 
dicted a 1 .1-kb gene from the revised Ler-0 sequence that 
was subsequently validated by PCR amplification and 
Sanger sequencing. TAIR BLASTP results of the putative 
Ler-0 allele suggest that it is an A CTIN-DEPOLYMERIZING 
FACTOR 4-like gene (fig. 6B). In addition, insertions within 
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Fig. 5. — SVs that overlap genes. (A and B) depict copia-like retrotransposon sequences inserted in the corresponding Ler-0 allele. (0 A 10-bp 
insertion within a gene resulted in an inferred intronless transcript variant. (D) illustrates a deletion and (E) a transposon-like insertion in regulatory 
regions. Augustus program (Stanke and Morgenstern 2005) is used to predict coding sequences (CDS) of the Ler-0 alleles. 



genes can also lead to the formation of inferred new tran- 
scripts, for example, an intronless variant (fig. 5Q. A further 
noteworthy observation is the insertion in the third intron of 
At3G46060 (supplementary fig. 4/\, Supplementary Mate- 
rial online) encoding a GTP-binding protein involved in eth- 
ylene signaling (Zimmerli et al. 2008). 



SVs occurring in regulatory regions can influence gene 
expression through numerous positional effects (Feuk 
et al. 2006). Deletion of regulatory elements (fig. 5D) or 
insertion within such elements (fig. 5E) might affect expres- 
sion of the immediate downstream gene and also the 
successive gene if both genes share the same c/s-regulatory 




Fig. 6. — Large deletion within a putative Col-0 gene suggests the formation of a novel allelic variant in Ler-0. (A) An 812-bp Col-0 deletion is 
found to be located within a cofilin/tropomyosin-type actin-binding gene (TAIR:At3G45990). Augustus program predicted a 1.1-kb gene model from 
the Ler-0 allele, which differs from the 3.4-kb Col-0 gene. {B) BLASTP revealed that the protein encoded by the Ler-0 allele has 45% pairwise amino acid 
identity to known ACTIN-DEPOLYMERIZING FACTOR 4 (ADF4) protein encoded by (TAIR:At5G59890). Identical amino acid motifs are highlighted in 
black and similar motifs in gray. 
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Fig. 7. — Zinc-binding protein gene cluster is enriched in SVs. Diagram illustrates the presence of large indels (red blocks) in a region where 
a cluster of zinc-binding protein genes (yellow arrows) is located. Gray blocks, brown arrows, and green arrows indicate unamplified regions, 
transposable element genes, and other protein-coding genes, respectively. 



elements (Cordaux and Batzer 2009). In the third category, 
a high number of SVs were found in a region enriched with 
genes that encode zinc-binding proteins (fig. 7). Col-0 has 
two transposable element genes within this region, and we 
hypothesize that additional transposons are present in the 
unamplified gaps (data not shown). 

Our findings confirm that transposable elements do not 
merely cause genetic perturbations; they participate in gene 
regulatory networks in ways that SNPs could not achieve 
(Heard et al. 2010). In this work, fundamental challenges 
in SV detection were tackled using an LR-PCR-based rese- 
quencing approach that yielded valuable read positional 
information. Genome assemblies could be improved to 
show SVs if the experiment is planned in a way that incor- 
porates positional information to the reads. This work 
emphasizes the importance of detecting SVs as they can 
have significant implications on downstream biological in- 
ferences, particularly on the identification and the study 
of evolutionarily shared allelic variants. 

Discussion 

Many agree that the real excitement in whole-genome- 
sequencing experiments only starts when another genome 
of a closely related individual is sequenced (Ossowski et al. 
2008; Hoberman et al. 2009; McKernan et al. 2009). The 
human 1000 genomes project (Collins et al. 2003) and 
the Arabidopsis 1001 genomes project (Weigel and Mott 
2009) are two examples of joint international collaborations 
to create a catalogue of intraspecific genetic variations. To- 
gether with the rapid advancement in NGS technology and 
the reduction in sequencing costs, there has been a massive 
proliferation in the number of drafted genomes produced. 
However, the inability of current assembly programs to ad- 
dress problematic areas has resulted in the generation of 
many poor quality drafts (Chain etal. 2009). Capturing large 
genomic SVs has been particularly challenging (Chen et al. 
2009; Kidd et al. 2010). 

In an attempt to identify problematic regions and find 
methods for improving draft genomes, we performed a re- 
sequencing experiment at a selected genomic interval of the 
Arabidopsis Ler-0 accession. Fine-scale sequence analysis at 
this target locus suggests that A. thaliana Ler-0 and Col-0 ge- 
nomes are highly variable. From our analysis, it was observed 
that the Ler-0 draft sequence accurately incorporates Col-0/ 



Ler-0 polymorphisms if they are short in length and/or lo- 
cated in regions that are not conserved, duplicated, or 
repetitive. On the contrary, large SVs that lie in conserved, 
duplicated, or repetitive regions such as variations in gene 
families and transposon-like indels were not incorporated 
in the Ler-0 draft. Nevertheless, those SVs may affect gene 
integrity and expression. Over 700 indels (supplementary ta- 
ble 1, Supplementary Material online) between Ler-0 and 
Col-0 and 15 large indels (figs. 5, 6, and 7; supplementary 
fig. 4, Supplementary Material online) present in genes and 
regulatory regions were identified. Seven of these indels in- 
volve transposon-like sequences. Although once thought to 
be "junk" DNA, an increasing number of studies have shed 
new light on the functional role of these jumping genes 
(Lippman et al. 2004; Wheelan et al. 2005). Transposons 
represent a dynamic portion of genomes, where some 
can mediate rearrangements of adjacent DNA (Bennetzen 
2005), present new regulatory effects on nearby genes 
(Michaels etal. 2003; Blewittetal. 2005; Weil and Martienssen 
2008; Lisch 2009), and contribute to gene expression diver- 
gence between closely related species (Hollister et al. 201 1 ). 
The presence of transposons could also affect recombina- 
tion in adjacent genes by heterochromatic effects (He and 
Dooner 2009). 

Our results also suggest the occurrence of a potential 
synteny break (Al-Shahrour et al. 2010) between Ler- 
0 and Col-0 within a zinc-binding protein gene cluster 
(fig. 7). Ten large indels that include three transposon-like 
insertions, one transposon deletion in Ler-0, and three un- 
amplified gaps further imply that this neighborhood has 
been dynamically reorganized in Ler-0. Indeed, functional 
clusters in mammals are significantly enriched by SINE ele- 
ments as they contribute to the rearrangement process 
(Zhao et al. 2004). The prevalence of SVs in genie regions 
can potentially lead to the formation of natural allelic var- 
iants or alter gene expression and function altogether. Thus, 
it is imperative for drafted genomes to incorporate SVs in 
both coding and noncoding regions so that accurate 
biological and evolutionary inferences can be drawn from 
comparative genomics studies on closely related individuals. 

Positional Information Allows Correct Assembly of 
SVs 

Whole-genome sequencing is now a routine practice, 
thanks to the advancements in sequencing technology. 
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Assembling large and complex genomes is unfortunately 
a less straightforward task. For example, it is particularly 
challenging to deduce large insertions in nonreference ac- 
cessions, variations within conserved or duplicated regions, 
and variations in microsatellite repeat lengths. Moreover, if 
the reference accession has a reduced genome (Schmuths 
et al. 2004), it can significantly impair insertion-based SV de- 
tection in nonreference accessions (supplementary fig. 5, 
Supplementary Material online). Using a combination of 
wet lab and dry lab approaches, we demonstrated the fea- 
sibility in resolving regions that have marked deviations from 
the reference genome. Amplicon size information was 
employed to identify the location of large SVs. Once the ap- 
proximate location was identified, it can be narrowed down 
to the point where the variation starts by looking for misas- 
sembly signatures. Local iterative read mapping was then 
performed to resolve the variation in question, and the 
length of the newly deduced sequence was then compared 
with its respective amplicon size. Algorithms for iterative 
gap closure have been described elsewhere (Tsai et al. 
201 0). However, these algorithms detect gaps in assemblies 
and are not suitable for insertions that do not manifest as 
assembly gaps (supplementary fig. 2B, Supplementary Ma- 
terial online). 

Conserved or duplicated regions can affect variant detec- 
tion, for example, large deletions in conserved regions, 
transposon-like indels, and polymorphisms within gene 
families. Santuari and colleagues have recently demon- 
strated the combined use of tiling array hybridizations with 
NGS to detect large deletions by identifying regions that 
have weak hybridization signals along with the absence 
of short reads (Santuari and Hardtke 2010; Santuari et al. 
2010). Here we show that fine-scale manual inspection 
can resolve regions that are conserved, duplicated, or repet- 
itive. Information contained in a single read is significantly 
limited by its length and can result in ambiguous placement 
of reads to homologous regions (Young et al. 2010). Aber- 
rant alignments of homologous reads may inflate the 
number of false-positive detections (Pool etal. 2010). In par- 
ticular, we observed ambiguous placements of transposons 
in the Ler-0 draft. Although deletions are easier to detect, 
we have nevertheless identified large deletions absent from 
the Ler-0 draft. Deletions that lie in conserved regions will be 
missed (false negatives) as reads from homologous regions 
can map to the reference sequence although it is not present 
in the study accession. Because our work was targeted to 
a specific locus, regions that are duplicated elsewhere in 
the genome will not interfere with the iterative mapping 
step, unless a particular region is duplicated within the locus 
itself. Therefore, we emphasize the importance of having 
positional information that assists sorting of reads to their 
respective locations and allows the resolution of duplica- 
tions independently without interference from other 
homologous sequence reads. 



Previously, an indel prediction has been performed using 
the 2-fold redundant Ler-0 shotgun contigs generated by 
Cereon Genomics (Ziolkowski et al. 2009). Thirteen out 
of the 19 predicted indels that fall within the locus of interest 
were found to be false positives, the largest being a 7.8-kb 
insertion. The high rate of false-positive predictions can be 
attributed to the assignment of incorrect chimeric Cereon 
contigs (Lai AG, Dijkwel PP, unpublished data) that have par- 
tial sequence homology to a particular region. The incorrect 
placement of contigs is therefore exacerbated by the ab- 
sence of positional information. 

Another challenge in whole-genome assembly is the ac- 
curate deductions of microsatellite repeat lengths from 
short-read data sets (supplementary table 1E, Supplemen- 
tary Material online). In theory, paired-end mapping should 
mitigate this problem if the gap spanned by the paired reads 
is larger than the repeat itself. Most paired-end libraries, 
however, lack sufficient coverage to enable reliable 
sequence predictions (Schatz et al. 2010). Therefore, posi- 
tional information is useful for the sorting of repetitive 
sequences to their respective genomic locations. Further- 
more, accurate deduction of repeat length is crucial in order 
to reveal rare allelic variants (Sureshkumar et al. 2009). In 
short, significant progress can be made on genome assem- 
bly if the experimental design prior to sequencing is 
modified such that positional information is incorporated in- 
to data sets. 

Genes Associated with SVs May Evolve New 
Functions 

The organization of SVs has two implications on genome 
evolution. First, structural changes can be observed in 
regions that have high rates of evolutionary turnover and 
second it allows genes that are duplicated or transposed 
to new chromosomal regions to be free from selective con- 
straints and evolve independently, giving rise to genes with 
altered functions or altered regulation (Samonte and Eichler 
2002). The most common type of SVs that affect genes are 
segmental duplications where a likely outcome would be 
the accumulation of partial gene structures or pseudogenes 
(Lynch and Conery 2000; Zhang 2003). These paralogous 
genomic copies have been often treated as "dead on ar- 
rival." Recent studies on whole-genome tiling analysis, how- 
ever, revealed that pseudogenes can be expressed (Akama 
et al. 2009). Expressed pseudogenes also play a role in the 
regulation of the messenger RNA stability of its homologous 
coding gene (Hirotsune et al. 2003). Gene density greatly 
correlates with segmental duplication density and in com- 
parison to unique genes; genes in segmental duplicated re- 
gions are more likely to display inter- and intraspecific CNV 
(Tuzun et al. 2005) along with signatures of positive selec- 
tion (Johnson et al. 2001; Birtle et al. 2005). Although 
genes affected by SVs are most likely associated with subtle 
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phenotypic alterations due to selective constraints, they can 
nevertheless have an influence on the phenotype by altering 
gene dosage (Sharp et al. 2006). Genes involved in environ- 
mental interaction and host defense have been found to be 
enriched with SVs (Ernes et al. 2003; Tuzun et al. 2005). Ex- 
amining structurally dynamic regions of the genome may 
provide clues on lineage-specific adaptation patterns (Ernes 
et al. 2003; Sharp et al. 2006) that are under diversifying 
positive selection pressure. 

Harnessing Positional Information to Boost 
Comparative Genomics 

Our work suggests that positional information is important 
for obtaining reliable ordering of scaffolds on chromosomes 
and improving genome assembly to unveil dynamic genome 
architectures. Likewise, the development of high-resolution 
physical maps (Lewin et al. 2009) are indispensible to the 
ordering of contigs in whole-genome alignments and also 
for the discovery of evolutionary break point regions based 
on comparative physical maps (Larkin et al. 2009). A com- 
parison between two forms of genome assembly, that is, 
hierarchical sequencing of large insert clones and whole- 
genome shotgun sequence assembly (WGSA) of reads, 
revealed that the WGSA method yields a 20-Mb shorter se- 
quence than the clone-based assembly (Marques-Bonet 
et al. 2009). Length discrepancy is caused by the failure 
of many whole-genome shotgun reads to map to a locus 
containing a highly duplicated and rapidly evolving gene 
family (Johnson et al. 2006). This problem will be further 
aggravated when significantly shorter NGS reads are used 
(Marques-Bonet et al. 2009). 

A fail proof method that accurately detects SVs is still po- 
tentially missing. We envisage genome-sequencing experi- 
ments to proceed in a clone-based manner that allows the 
incorporation of positional information to the generated 
reads. This technique is comparable to the "first-map, then 
sequence" strategy that uses a BAC-based scaffolding method 
(Kuhl et al. 2010), which has been successfully implemented 
in various sequencing projects (Fujiyama et al. 2002; Larkin 
et al. 2009; Lewin et al. 2009). Construction of large DNA in- 
sert libraries will be useful for genome-sequencing projects. 
This form of genome partitioning will undoubtedly require 
more work than generating reduced representation libraries 
from restriction digestions (Young et al. 2010). Although re- 
duced representation libraries can simplify assembly and po- 
tentially yield larger contigs, it lacks the positional information 
required to tease out duplicated regions. 

With the current capacity, an entire genome can be se- 
quenced on a single flow cell by making pools of large insert 
clones and subsequently multiplexing these pools. These reads 
will have positional information and can then be assigned to 
their corresponding genomic intervals where de novo contigs 
can subsequently be generated from these region-specific 



reads. Using the combinatorial pooling and multiplexing strat- 
egy, tens of thousands of different samples can be analyzed 
with only several hundred appended barcodes (Erlich et al. 
2009). Different levels of multiplexing can also be performed 
to achieve the desired resolution based on resource availability 
(Wood et al. 2010). With such positional information avail- 
able, it is possible to elucidate more complex forms of poly- 
morphisms that include segmental duplications, transversions, 
and transposition events. The size of each clone can be used to 
validate the accuracy of the assembled contigs. Indeed, sev- 
eral groups have started to follow the clone-based sequencing 
approach at a low-resolution scale in order to capture a more 
representative depiction of large intraspecific variations (Kidd 
et al. 2008; Hurwitz et al. 2010). 

NGS platforms have been widely used in targeted 
resequencing experiments on selected genomic intervals 
(Martinez Barrio etal. 2009; Turner etal. 2010). Resequencing 
experiments are often required for the study of intraspecific 
polymorphisms in regions suspected to host vast amounts 
of variations. Discovering beneficial or heterotic genetic traits 
in crop species is primarily performed using a quantitative trait 
loci (QTL) mapping strategy. Because reference genomes per 
se may not contain the locus of interest, our approach can 
successfully identify such SV-related QTL. A majority of drafted 
genomes fail to provide sufficient granularity for comparative 
genomics in this sense. Hence, most studies on haplotypic var- 
iants still rely on clone-based Sanger shotgun sequencing 
(Alcazar et al. 2009; Heuer et al. 2009). An undistorted view 
of data quality is important for end users; hence, each ge- 
nome should be independently assembled to reveal haplotypic 
differences. Furthermore, the accuracy of downstream gene 
annotations relies on the fidelity of the initial assembly. The 
resulting annotations will not only mislead end users but also 
defy the initial justification of comparative genomics. Elucida- 
tion of complex and dynamic regions of the genome should 
be the end goal of NGS projects apart from cataloguing 
small variations such as SNPs. The full benefit of compar- 
ative genomics can only be realized when high-quality ge- 
nome sequences are available. 

Supplementary Material 

Supplementary figures 1-5 and tables 1 and 2 are available 
at Genome Biology and Evolution online (http:// 
www.gbe.oxfordjournals.org/). 
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